﻿ Interconnecng lexicographic resources In search for a model Dan Cristea “Alexandru Ioan Cuza” University of Iași Instute of Computer Science of the Romanian Academy dcristea@info uaic ro Topics • Why would one want to connect linguisc resources? • Parameterising the needs • Standardisaon helps interconnecng • A bunch of notorious resources • How would this work? • Final remarks COST-ENeL, Bled, 29-30 September 2014 Why would one want to connect linguisc resources? • Use case 1: 100 Romanian diconaries aligned – CLRE Essential Romanian Lexicographical Corpus 100 dictionaries aligned at entry and, partially, sense levels (2010 –2013, at Institute A Philippide of the Romanian Academy, in Iași • dictionaries’ list at: http://85 122 23 90/resurse/Lista-dictionarelor doc • written in 3 types of alphabets: Cyrillic, transition and Latin • large diversity of formatting styles Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014 CLRE Essential Romanian Lexicographical Corpus Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014 The CLRE project ISER Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014 The CLRE project Petri Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014 The CLRE project DN II Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014 The CLRE project Bălăşescu Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014 The CLRE project Lexicon Militar Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014 The CLRE project Dicționar de informacă Bucharest, 14-15 December 2012 COST-ENeL, Bled, 29-30 September 2014 Processing in CLRE • Scanning • OCR – Abby Fine Reader 9 • Parsing entries => XML • Manual verification • Indexing and alignment Iași, 25-26 September 2013 COST-ENeL, Bled, 29-30 September 2014 CLRE – manual veriﬁcaon Iași, 25-26 September 2013 COST-ENeL, Bled, 29-30 September 2014 Why would one want to connect linguisc resources? • Use case 2: align WN with an explanatory diconary – a WN synset: pos (def, ex, w1s1 …wksk … wnsn) – an explanatory diconary entry: wk, pos, … … COST-ENeL, Bled, 29-30 September 2014 Why would one want to connect linguisc resources? • Use case 2: align WN with an explanatory diconary – synsets of a word wk, pos: (def1, ex1, …wks1 …) … (defk, exk, …wksk …) … (defm, exm, …wksm …) – the explanatory diconary entry of the word wk, pos: wk, pos, … … COST-ENeL, Bled, 29-30 September 2014 Why would one want to connect linguisc resources? • Use case 2: align WN with an explanatory diconary – a WN synset: pos (def, ex, w1s1 …wksk … wnsn) – explanatory diconary entries: w1, pos, … … wk, pos, … … wn, pos, … … COST-ENeL, Bled, 29-30 September 2014 Why would one want to connect linguisc resources? • Use case 3: the TOT problem or the forgoen word Entire lexiconD Reduced search-spaceA:C : Categorial TreeB: : Chosen word Post-processing 1° Ambiguity detection via WN 2° Disambiguation: via clustering A ableassociated termspotential categories (nodes), Ato the input : ‘coffee’for the words displayed system builder (beverage)in system builderStep-1: the search-space (B):Step-2: + labelingevoked- beverage, food, color,Clustering +/or usetermcoffee- used for, used withCreate network - quality, origin, placeassociative 1° via computation (E A T, collocations2° via a resource derived from corpora)3° via a combination BLck of resources (WordNet, 39 0 39BISCUITS 1 0 01TEA Roget, Named Entities, …)Categorial treeC 7 0 07 oBITTER 1 0 01CUP 5 0 05targetDARK 1 0 01BLACK 4 0 04mochaDESERT 1 0 01BREAK 40 0 4 wordZDRINK 1 0 01ESPRESSO 3 0 03COLORFRENCH 1 0 01POT S T EFOOD 2 0 02TAGROUND 1 0 01CREAM 2 0 02lINSTANT 1 0 01HOUSE 2 0 02NMACHINE 1 0 01MILK 20 02 MOCHA 1 0 01CAPPUCINO of OKYDRINKset 2 0 02COwordsPre-processingeMORNING 1 0 01STRONG 2 0 02 MUD 1 0 01SUGAR 2 0 02NEGRO 1 0 01TIME Ambiguity detection via WN a 1 0 011°SMELL 1 0 01BAR espresso 1 0 01setTABLE 1 0 01BEANcappucinoStep-2: of user BEVERAGE 1 0 01wordsmocha Interactive disambiguation: Navigation + choice2° coffee: ‘beverage’ or ‘color’ ? ch1° navigate in the tree + determine whether it contains the target or a more or less related word Target word Step-1: user Mi 2° Decide on the next action : stop inputhere, or continue Provide say, ‘coffee’Zzero Tree designed for navigational purposes (reduction of search-space) The Given some input the system displaysleaves contain potential target words and the nodes the names of their all directly associated words, categories, allowing the user to look only under the relevant part of the tree Hypothetical lexiconi e direct neighbors (graph), Since words are grouped in named clusters, the user does not have to go containing 60 000 wordsordered by some criterion or notthrough the whole list of words anymore Rather he navigates in a tree (top- to-botton, left to right), choosing ﬁrst the category and then its members, to check whether any of them corresponds to the desired target word COST-ENeL, Bled, 29-30 September 2014 The ﬁrst thought: standardisaon • Lexical Markup Framework (LMF) – What is it? • a common model for creaon and use of lexical resources – With what goal? • to manage the exchange of data between and among these resources • to enable the merging of a large number of individual electronic resources to form extensive global electronic resources COST-ENeL, Bled, 29-30 September 2014 Near-standard • Text Encoding Iniave (TEI) – What is it? • an inventory of the features most oen deployed for computer-based text processing • recommendaons about suitable ways of represenng these features – With what goal? • to facilitate processing by computer programs • to facilitate the loss-free interchange of data amongst individuals and research groups using diﬀerent programs, computer systems, or applicaon soware COST-ENeL, Bled, 29-30 September 2014 Standardisaon • Text Encoding Iniave (TEI) – Example of a diconary entry serialisaon (from TEI Guidelines) disproof (dIs"pru:f) n 1 facts that disprove something 2 the act of disproving CED disproof dIs"pru:f n facts that disprove something the act of disproving COST-ENeL, Bled, 29-30 September 2014 LMF and TEI content and… discontent • The TEI format may be used as an interchange format, perming sharing of resources even when their local encoding schemes diﬀer • Both LMF and TEI model lexical material at a deep representaonal detail… COST-ENeL, Bled, 29-30 September 2014 LMF and TEI content and… discontent • TEI intenon: – guidance for individual or local pracce in text creaon and data capture – support of data interchange – support of applicaon-independent local processing • Opening good possibilies of querying • But how would funcon the interconnecon? COST-ENeL, Bled, 29-30 September 2014 Parameterising the needs • If I want to connect two resources, simply merge the contents • Then be able to interrogate the merged resource by taking advantage of peculiaries in each resource COST-ENeL, Bled, 29-30 September 2014 Parameterising the needs • Able to represent variaons in word forms, alternate orthography, diachronic morphology • Easy navigaon by applying various ﬁltering criteria COST-ENeL, Bled, 29-30 September 2014 Parameterising the needs • Very oen lexicographic data is hierarchical – for instance, a sense of a diconary entry contains a deﬁnion, examples, but also sub-senses • Organise even recursive searches – give me the definition neighbouring sphere of depth 2 of the word captain (take all senses of the entry captain and form the list of words in the corresponding deﬁnions, then for each of them take all their senses and collect again words in their deﬁnions) COST-ENeL, Bled, 29-30 September 2014 The idea • Represenng lexical informaon as feature structures centred on word’s lemmas – disproof (dIs"pru:f) n 1 facts that disprove something 2 the act of disproving CED [lemma=disproof, entry=[pron=dIs"pru:f, pos=n, sense=[n=1, def=facts that disprove something], sense=[n=2, def=the act of disproving], res=CED]] COST-ENeL, Bled, 29-30 September 2014 Represenng lexical entries as feature structures lemma=disproof ry na pron=dIs"pru:f io pos=n ict D entry= n=1 sense= ish def=facts that disprove something l g En n=2 sense= ge def=the act of disproving rid res=CED mb Ca COST-ENeL, Bled, 29-30 September 2014 Represenng lexical entries as feature structures • Graph representaon disproof dIs"pru:f lemma pron n pos entry 1 sense n facts that disproves smth sense def 2 n res the act of disproving def CED COST-ENeL, Bled, 29-30 September 2014 Represenng lexical entries as feature structures ry na lemma=disproof io ict e D t ia pron=dIs"pru:f lleg pos=n o C entry= n=1 ’s sense= def=the action of disproving r ste eb sense= n=2 def=evidence that disproves am-W res=MWCD rri Me COST-ENeL, Bled, 29-30 September 2014 Represenng lexical entries as feature structures • Graph representaon disproof dIs"pru:f lemma pron n pos entry 1 sense n the action of disproving sense def 2 n res evidence that disproves def MWCD COST-ENeL, Bled, 29-30 September 2014 How could lexical entries be merged? • Entries of the same word from diﬀerent diconaries disproof dIs"pru:f disproof dIs"pru:f lemma pron n lemma pron n pos 1 pos entry entry 1 sense n sense n the action of… def facts that… sense 2 sense def 2 res n n res def evidence that… the act of… def MWCD CED COST-ENeL, Bled, 29-30 September 2014 Merging lexical entries • Disnct parts disproof dIs"pru:f disproof dIs"pru:f lemma pron n lemma pron n pos 1 pos entry entry 1 sense n sense n the action of… def facts that… sense 2 sense def 2 res n n res def evidence that… the act of… def MWCD CED COST-ENeL, Bled, 29-30 September 2014 Merging lexical entries disproof dIs"pru:f lemma pron n pos entry X (new) n 1 sense the action of… sense def 2 n res evidence that… X (new) def MWCD 1 sense n facts that… sense def 2 n res the act of… def CED COST-ENeL, Bled, 29-30 September 2014 Representaon of the merged feature structure lemma=disproof pron=dIs"pru:f pos=n n=1 sense= def=facts that disprove something X= sense= n=2 entry= def=the act of disproving res=CED n=1 sense= def=the action of disproving X= sense= n=2 def=evidence that disproves res=MWCD COST-ENeL, Bled, 29-30 September 2014 The WordNet search for disproof COST-ENeL, Bled, 29-30 September 2014 Feature structures representaon for the WN synsets of disproof (*, falsification, refutation) disproof n lex lemma pos gloss any evidence that helps to establish the falsity of something synsets synset synset (falsification, falsifying, *, refutation, refutal) lex gloss (the act of determining that something is false COST-ENeL, Bled, 29-30 September 2014 The WordNet search for discount COST-ENeL, Bled, 29-30 September 2014 Represenng WordNet synsets n (*, price reduction, deduction) lex pos gloss the act of reducing the selling price discount synset of merchandise (discount rate, *, bank discount) lex lemma synset gloss interest on an annual basis deducted synsets in advance on a loan synset synset … (dismiss, disregard, brush aside, brush … v off, *, push aside, ignore) pos lex gloss bar from attention or consideration synsets synset ex “She dismissed his advances" synset … COST-ENeL, Bled, 29-30 September 2014 How could diconary dIs"pru:f entries be merged with disproof emma pron n WN synsets? lpos entry 1 sense n facts that disproves smth sense def 2 n res the act of disproving def CED (*, falsification, refutation) disproof n lex lemma pos gloss any evidence that helps to establish the falsity of something synsets synset synset (falsification, falsifying, *, refutation, refutal) lex gloss (the act of determining that something is false COST-ENeL, Bled, 29-30 September 2014 Merging a diconary isproof dIs"pru:f entry with a WN entry d lemma pron n pos entry 1 sense n facts that disproves smth sense def 2 n res the act of disproving def CED (*, falsification, refutation) disproof n lex lemma pos gloss any evidence that helps to establish the falsity of something synsets synset synset (falsification, falsifying, *, refutation, refutal) lex gloss (the act of determining that something is false COST-ENeL, Bled, 29-30 September 2014 Merging a diconary dIs"pru:f pron entry with a WN entry n pos 1 n disproof sense entry def facts that disproves smth lemma sense 2 n res the act of disproving def CED n (*, falsification, refutation) lex pos gloss any evidence that helps to establish synsets the falsity of something synset synset (falsification, falsifying, *, refutation, refutal) lex gloss (the act of determining that something is false COST-ENeL, Bled, 29-30 September 2014 Going one step further • Feature structures are hierarchical data • Codd: hierarchical data can be represented as relaonal tables – Codd, E F (June 1970) "A Relaonal Model of Data for Large Shared Data Banks" Communicaons of the ACM 13 (6): 377–387 COST-ENeL, Bled, 29-30 September 2014 Represenng feature structures as relaonal tables w … lemma pron n pos 1 entry n sense … def cit … orth sense auth res … yr … … … … from http://en wikipedia org/wiki/Relational database COST-ENeL, Bled, 29-30 September 2014 Represenng feature structures as relaonal tables w … id lemma entry lemma pron n pos 1 entry n sense … def cit … orth sense auth res … yr … … … … from http://en wikipedia org/wiki/Relational database COST-ENeL, Bled, 29-30 September 2014 Represenng feature structures as relaonal tables w … id lemma entry lemma pron n pos 1 entry n sense … defentry pron pos sense res cit … orth sense auth res … yr … … … … from http://en wikipedia org/wiki/Relational database COST-ENeL, Bled, 29-30 September 2014 Represenng feature structures as relaonal tables w … id lemma entry lemma pron n pos 1 entry n sense … defentry pron pos sense res cit … orth sense auth sense n def cit res … yr … … … … from http://en wikipedia org/wiki/Relational database COST-ENeL, Bled, 29-30 September 2014 Represenng feature structures as relaonal tables w … id lemma entry lemma pron n pos 1 entry n sense … defentry pron pos sense res cit … orth sense auth sense n def cit res … yr … … … cit orth auth yr … from http://en wikipedia org/wiki/Relational database COST-ENeL, Bled, 29-30 September 2014 Relaonal operators • Projecon: πa1(R) => a relaon containing ,…an only values of aributes a1,… an from the relaon R • Selecon: σφ(R), with ϕ is logical condion => only tuples verifying the condion ϕ are retained from the relaon (or the set) R • Join: R✜S => the set of all aributes in R and S that are equal on their common aributes • Union: RŮS => a table represenng the union of the two relaons COST-ENeL, Bled, 29-30 September 2014 Interrogang a diconary • Citaons before 1850 of the entry “symphony” w … lemma pron n pos 1 entry n sense … def cit … orth sense auth res yr … … … … π(σ(WORD✜ENTRY✜SENSE✜CIT)) orthlemma=“symphony” & yr TEI representaon => as feature structures => hierarchical graphs => relaonal tables – more resources => uniﬁcaons of tables – use query and relaonal operators for interrogaon COST-ENeL, Bled, 29-30 September 2014 Discussion • Only a sketch – a lot of details should sll be ﬁlled in – the good news: XML structures (the nave language of TEI) accept direct representaons as database records: XSLT => opening direct access to a complex querying language: XQuery => mimicking the relaonal operators and adding more facilies COST-ENeL, Bled, 29-30 September 2014 Discussion • Another good news – represenng variable depth structures – recursive hierarchies: Kamfonas • Fixed depth dimensions are simpler to implement, maintain and query… Hierarchies that have variable depth or an uncertain number of levels… can oen beneﬁt if implemented as recursive hierarchies hp://www kamfonas com/id3 html COST-ENeL, Bled, 29-30 September 2014 Discussion • Even more good news (hopes) – interrogaons can be formulated in natural language => an interpreter translates them in the query language of a DBMS system – as such, a handy tool at the beneﬁt of lexicographers COST-ENeL, Bled, 29-30 September 2014 Acknowledgements • Work parally supported by the project The Computaonal Representave Corpus of Contemporary Romanian Language, a project of the Romanian Academy and parally by the COST-ENeL project • I thank Isabelle Tamba and Mădălin Pătrașcu for the slides describing the CLRE project COST-ENeL, Bled, 29-30 September 2014 Thank you! COST-ENeL, Bled, 29-30 September 2014 